Introduction

As required, this task was an open one, so the students had to choose a specific topic on their own. Our Group did choose a dataset we found on https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset. This subset Contains 10k Music files and is around 2GB big. The actual dataset is about 300GB big and has arround 1 MIllion entries, in this case songs. Besids the Analysis, the dataset includes some Metadata, like Author, produced year etc. and finally music data features for each song in HDF5 format. The actual Provider of this data set is THE ECHO NEST (http://the.echonest.com), which used to be a music intelligence and data platform for developers until Spotify (https://www.spotify.com/de/), a famous music streaming provider, acquired The Echo Nest.1 As provided by the information about the dataset, it is a result of an collaboration between THE ECHO NEST and LabROSA (https://labrosa.ee.columbia.edu). 2

Our goal in this Project is going to be an analysis of some songfiles that we prefer. Since all of the musicfiles are labeled with artist- and songnames as well as the year of production, we can find allmost every song eather on YouTube (https://www.youtube.com) or on Spotify (https://www.spotify.com/de/). First we are going to listen to some of the songs to find the ones that we prefer. Further, we are going to analyze that songs to have a good understanding of data that describes our preferation. Last we are going to use spotify for prediction. Thus we hope for a better analysis and understanding of the given data. Otherwise we would be comparing mostly different data that is not suitable for research purposes.

Alongside with the above analysis we also want to have some more general information about the artists and their songs. Therefore we are going to visualize some general information too.

Handle the downloaded data

After downloading and unzipping the data, one can see two different folders. The first one, ‘data’, containing several other folders and the second one ‘AdditionalFiles’, containing some adittional files in either SQL or txt format. The directory structure is based on The Echo Nest Track IDs 3. The ‘data’ folder contains exlusively songfiles in HDF5 (Hirarchical Data Format 5) format. This format is mostly used in science apllications for big datasets. It was developed by NASA 4 to handle large, heterogeneous and hirarchical datasets. The content of those files handles some analysis, some metadata and some more information that is stored on MusicBrainz (https://musicbrainz.org), an open music encyclopedia. The data availabla in ‘AdditionalFiles’ is going to be used for first hands on the whole dataset, to get to know the dataset since the access is simple. By doing so we will prevent some general information about the dataset. To read both datafolders one should install some additional packages that will be mentioned later on.

For more information about the dataset especially about the frequent asked questions we recomend to go to (https://labrosa.ee.columbia.edu/millionsong/faq).

Preprocess the Additional files

When accessing the data provided in ‘AdditionalFiles’ folder, one has to remove the Seperators <SEP> and replace those with a common seperator like ‘;’. This should be done, because R is used to a one byte seperator and therefor it is not possible to read a file with a seperator like <SEP>.

The following codechunk was only used to access the txt files in RStudio. This is done with .csv()

location <- read.csv2('data/subset_artist_location.txt',sep = ';', header = FALSE, col.names = c('artistId', 'lat','lon',  'trackID', 'artistName'))

artists <- read.csv2('data/subset_unique_artists.txt',sep = ';', header = FALSE, col.names = c('artistId', 'V2', 'trackID', 'artistName'))

tags <- read.csv2('data/subset_unique_mbtags.txt',sep = ';', header = FALSE, col.names = c('tags'))

uni_terms <- read.csv2('data/subset_unique_terms.txt',sep = ';', header = FALSE, col.names = c('terms Unique' ))

tracks <- read.csv2('data/subset_unique_tracks.txt',sep = ';', header = FALSE, col.names = c('trackID','V2', 'artistName','songName'))

tracksPerYear <- read.csv2('data/subset_tracks_per_year.txt',sep = ';', header = FALSE,  col.names = c('Year', 'trackID', 'artistName','songName'))

General information visualized

The following code loads the packages that are required to make a wordcloud. Furthermore we figured out, that the first wordcloud we created, had a very bad distribution. Mostly because of the most common words in english language. Therefor we cleaned our dataset from this words according to our findings and the wikipedia page (https://en.wikipedia.org/wiki/Most_common_words_in_English). Thus we used ‘the’,‘and’ and ‘a’ to clean the dataset.

# Load packages
library("NLP")
library("tm") # for text mining
library("SnowballC") # for text stemming
library("RColorBrewer") # color palettes
library("wordcloud") # word-cloud generator 

docs <- Corpus(VectorSource(as.String(artists$artistName)))

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

others <- c('the','and','a')

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs <- tm_map(docs, toSpace, others[i])
}

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, scale = c(3,0.2),
          max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
mtext('Artistnames', side = 2, line = 1, adj = 0.5) # title

head(d,8)
##              word freq
## john         john   41
## orchestr orchestr   38
## los           los   31
## vid           vid   31
## turing     turing   25
## joe           joe   21
## bro           bro   19
## king         king   19

When looking at the plot above one can see, that the common artistnames are eather orchestr or John. Also there are some spanisch words like los, so one can see that this dataset consists not only out of english artist but also spanish artists. Some other names like Joe or King are also quite common used. To make some more assumptions and to get a better understanding of the wordcloud, the actual frequencies of the very frequent entries where provided in a table. Together with this table and the wordcloud one could have a better understanding of the distribution of the Artistnames in the given dataset. After a small search on the internet (https://en.wikipedia.org/wiki/List_of_most_popular_given_names#Male_names_2) one can see, that John was one of the most common names in the 1990’s.

tracksPerYear$artistName[tracksPerYear$Year >= 1990 & tracksPerYear$Year <= 2000]
##  [1] K's Choice            K's Choice            Kaija Koo            
##  [4] Kisha                 Lee Ritenour          Les Malpolis         
##  [7] Lisa Lynne            Los Amigos Invisibles Los Amigos Invisibles
## [10] Luciana Souza         M.A. Numminen         Mandi                
## [13] Martin Sexton         Martin Sexton         Mithotyn             
## [16] Mithotyn              Monster Magnet        Moonspell            
## [19] Mudhoney              Natural Elements      Nic Endo             
## [22] Old Man's Child       OutKast              
## 1149 Levels: !!! 2 Minutos 2-4 Grooves feat. Reki D. ... Zombina & The Skeletones

After displaing the actual dataset and the entries of the artistnames between the years 1990 and 2000, the assumption made before should be declined. However one can see another common word in the displayed subset ‘Los’. The final statement about this set can not be well desribed but could be seen as a description of the given dataset without any further evidence. One statemnet is still appropriate. This dataset is distributet (after cleaning the dataset), like the wordcloud and the table of frequencies is displaying it.

Almost the same analysis we did on common songnames. However the common words in this case where not quit the same as in the script before. Finding the common songnames we had to plot the wordcloud as uncleaned version. Thus the cleaning with found words like ‘the’,‘version’,‘and’,‘from’, ‘feat’ and ‘album’ created the followed wordcloud. Because we decided that those words don’t contain a lot of information as well as they are common words in english language we cuted these word out of the provided dataset.

# Load packages
library("NLP")
library("tm") # for text mining
library("SnowballC") # for text stemming
library("RColorBrewer") # color palettes
library("wordcloud") # word-cloud generator 

docs1 <- Corpus(VectorSource(as.character(tracks$songName)))

# Convert the text to lower case
docs1 <- tm_map(docs1, content_transformer(tolower))

others <- c('the','version','and','from', 'feat','album')
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs1 <- tm_map(docs1, toSpace, others[i])
}

dtm <- TermDocumentMatrix(docs1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(words = d$word, freq = d$freq, min.freq = 1, 
          max.words=100, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"), main = "TITL")
mtext('Songnames', side = 3, line = 0, adj = 0.5) # title

head(d,7)
##      word freq
## you   you  540
## love love  332
## live live  216
## for   for  185
## all   all  144
## your your  143
## don   don  137

Looking at the result one can see the frequently words ‘you’ and ‘love’. Interpreting this result, it is clear that this dataset consists of Songnames that are more likely to handle Love and the counterpart of a Human, you. A general assumption could be made, that there are more songs handling Love, the counterpart of someone and the live, then about technic or traveling for example. However this assumption can not be completle prooven since this dataset does not represent all the songnames in the world.

library(maps)
library(mapdata)
#library(eurostat)

# parse the lat and lon values of given set 
lon <- as.double(as.character(location$lon))
lat <- as.double(as.character(location$lat))

# delete all NaN
lon <- lon[!is.na(lon)]
lat <- lat[!is.na(lat)]

coordinates <- as.data.frame(cbind(lon, lat))

# take a closer look at europe 
#europe <- as.data.frame(cbind(lon = c(54.78333, 24.08464, -31.26192, 59.34569), lat = c(80.56667, 34.83469, 39.45479, 62.21215)))

map('world',c('.')) 
points(coordinates$lon, coordinates$lat, col = "red", cex = .1)

#x <- map('world', xlim = range(europe$lon), ylim = range(europe$lat), namefield = TRUE)
#x$names <- gsub("\\:.*","",x$names)

map(col = "grey80", border = "grey40", fill = TRUE,
  xlim = c(-25, 45), ylim = c(36, 70), mar = rep(0.1, 4))
points(coordinates$lon, coordinates$lat, col = "red", cex = .3)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5) # required for H5 files

# set a hardcoded Path to the MillionSongSubset
pathToSet = '/Users/Kostja/Desktop/Master/Sem 2 (18 SoSe)/Data Visualization/Tasks/MillionSongSubset'

# create array with found Ids in beforehand containing prefered songs
TrackIDs <- array(c('TRAPZTV128F92CAA4E','TRANNZZ128F92C22F7','TRAQZQX128F931338F','TRALONM128EF35A199','TRAWBHE12903CBC4CB'))

# find automaticaly all paths with names of trackIDs
SubPaths <- lapply(TrackIDs,function(x){
  list.files(pathToSet, x, recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
})

# beautify the dataset 
SubPaths <- data.frame(SubPaths = t(unlist(SubPaths)))
names(SubPaths) <- c('beyonce', 'justin', 'kanye', 'madonna', 'bruno')



# read the H5 files and create a readable output
artist <- lapply(SubPaths, function(x){
  h5ls(toString(x))
})



Analyze_song <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/songs")
})
Analyze_song <- do.call(rbind, Analyze_song)

Meta_song <- apply(SubPaths,2,function(x){
  h5read(x,"/metadata/songs")
})
Meta_song <- do.call(rbind, Meta_song)

library(fmsb)

radarFrame <- function(df1, df2){
  matrix <- cbind('artist_familiarity' = df1$artist_familiarity, 'artist_hotttnesss' = df1$artist_hotttnesss, 'tempo'= df2$tempo, 'time_signature' = df2$time_signature, 'loudness' = df2$loudness, 'key' = df2$key) 
  rownames(matrix) <- rownames(df1)
  matrix <- data.frame(matrix)
}

namesLegend <- paste(Meta_song$artist_name,Meta_song$title)

radar <- function(df, namesLeg = namesLegend, x = -2.8 , y= -1.1){
  transparency <- adjustcolor(1:dim(df)[1], alpha.f = 0.2) 
  # Custom the radarChart !
  radarchart( df  , axistype=1 , maxmin = FALSE,
    #custom polygon
    pcol=1:dim(df)[1], plwd=1 , pfcol = transparency ,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol=FALSE ,
    #custom labels
    vlcex=0.8 
    )
  par(xpd=TRUE)
legend(x,y, legend = namesLeg, bty = "n", pch=20 , col=1:dim(df)[1] , cex=0.8, pt.cex=2)
}

data <- radarFrame(Meta_song, Analyze_song)

radar(data)

# anschauen für radar 
# artist familarity unter metadata
# hotness sind aber estimateionen dh von EchoNest berechnet und schwierig in der absoluten umgehensweise
# tempo in songs vergleichen mit anderer Seite weil nicht ganz richtig 
# time signature in songs auch mit anderer Seite vergleichen  beides aus dem gleichen Datensatz daher auch der gleiche Fehler, wenn nun anderer datensatz dazukommt kann es dazu kommen, dass der Fehler nicht mehr reproduzierbar ist und der bias komplett verfälscht wird und wir somit keine Aussage mehr treffen können.
# loudnes in songs
# key in songs

# Alles was oben ist von einer anderen Seite daten nehmen und radar plot erstellen zum vergleich

# loudnes max als detailierter wert 
compareFrame <- data.frame(rbind(
  beyonce = c('familiarity' = 70, 'tempo' = 97, 'time_signature' = 4, 'loudness' = -5,'key' = 1),
  justin = c('familiarity' = 70, 'tempo' = 76, 'time_signature' = 4, 'loudness' = -5,'key' = 7),
  kanye = c('familiarity' = 65, 'tempo' = 106, 'time_signature' = 4, 'loudness' = -5,'key' = 9),
  madonna = c('familiarity' = 54, 'tempo' = 119, 'time_signature' = 4, 'loudness' = -7,'key' = 9),
  bruno = c('familiarity' = 70, 'tempo' = 104, 'time_signature' = 4, 'loudness' = -6,'key' = 10)
))

# because all timesignatuires are 4, there is no proper graph 
# radarchart draws relatively
radar(compareFrame)

# not realy comparable as seen
par(mfrow = c(1,2))
radar(data,x=-2.2, y = -1.2)
radar(compareFrame,x=-2.2)

par(mfrow = c(1,1))

beyonce trackid TRAPZTV128F92CAA4E justin trackid TRANNZZ128F92C22F7 kanye trackid TRAQZQX128F931338F madonna trackid TRALONM128EF35A199 bruno mars TRAWBHE12903CBC4CB

# library(fmsb)
# Tune_Beyance
# Tune_Justin <- c(,,76,,-5,8)
# Tune_Kanye
# Tune_Bruno
# Tune_Madonna

loudness_start <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_loudness_start")
})

loudness_max <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_loudness_max")
})

par(mfrow= c(1,2))
boxplot(loudness_start, main = 'loudness_start' )
boxplot(loudness_max, main = 'loudness_max' )
mtext('Boxplots of loudness', outer = TRUE, side = 3, line = -1)

par(mfrow= c(1,1))


Draw_matrix_plots <- function(plt){
  layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, byrow = TRUE), heights=c(2,2))
  c <- 0
  invisible(lapply(plt,function(x){
  c <<- c+1
  plot(x,type = 'l', axes = FALSE, xlab = '', ylab = '', main = names(plt)[c])
  axis(2)
  axis(1)
  }))
  mtext(paste('Plot', deparse(substitute(plt)),'for different interprets' ), side = 3, line = -19, outer = TRUE)
  par(mfrow=c(1,1))
}

Draw_matrix_plots(loudness_start)

Draw_matrix_plots(loudness_max)

matplot_Draw <- function(plt){
  dFrame <- do.call(cbind, plt)
  matplot(dFrame,type = "l", col = 1:dim(dFrame)[2], ylab = "loudness", xlab = 'segmentstep', main = paste('matplot', deparse(substitute(plt))))
  legend("topleft", legend = names(plt), col = 1:dim(dFrame)[2], pch = 16)
}

matplot_Draw(loudness_start)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)

matplot_Draw(loudness_max)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)

# nicht sicher mit dem hier 
Analyze_pitch <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_pitches")
})
boxplot(Analyze_pitch)

Analyze_timbre <- apply(SubPaths,2,function(x){
  h5read(x,"/analysis/segments_timbre")
})

boxplot(Analyze_timbre)

Conclusion

The H5 data explained: https://labrosa.ee.columbia.edu/millionsong/pages/example-track-description

Sources

european limits : http://www.milanor.net/blog/maps-in-r-introduction-drawing-the-map-of-europe/ vergleichseiten: http://www.findsongtempo.com und http://www.tunebat.com


  1. https://en.wikipedia.org/wiki/The_Echo_Nest page view [02.06.18]

  2. https://labrosa.ee.columbia.edu/millionsong/ page view [02.06.18]

  3. TR+LETTERS + LETTERS&NUMBERS so the directorypath within the dataset is based on the first 3 letters after the 3rd one e.i ‘MillionSong/data/A/D/H/TRADHRX12903CD3866.h5’

  4. (National Aeronautics and Space Administration) https://www.nasa.gov/about/index.html page view [02.06.18]